Simulation-Based Calibration

Foundational Report 6

foundations
validation
m_0
m_1

SBC validation comparing parameter identification between m_0 and m_1, providing further evidence that the addition of risky choices improves identification of δ.

Author
Published

February 27, 2026

0.1 Introduction

In Report 4, we observed poor recovery of the utility increment parameters δ in model m_0, and in Report 5, we hypothesized that adding risky choices would improve identification. The parameter recovery experiments in Report 5 suggested improvement, but with only 20 iterations, the results may not be statistically robust.

Simulation-Based Calibration (SBC) provides a more principled approach to assess parameter identification. Rather than asking “can we recover specific true values?” (as in parameter recovery), SBC asks: “does the posterior correctly represent uncertainty about the parameters?” That is, if we repeatedly (i) draw parameters from the prior, (ii) simulate data given those parameters, and (iii) compute the posterior, then the posterior should, on average, be calibrated—assigning the correct probability to regions of parameter space. SBC operationalizes this idea by checking whether the true parameter value occupies a uniform rank position within the posterior samples; systematic departures from uniformity signal that the posterior is too wide, too narrow, or shifted relative to the data-generating process.

NoteThe SBC Principle

If the inference algorithm is correct and the model is identified, then:

  1. Draw θ* from the prior: \(\theta^* \sim p(\theta)\)
  2. Simulate data: \(y \sim p(y | \theta^*)\)
  3. Compute posterior: \(p(\theta | y)\)
  4. Calculate the rank of θ* in the posterior samples

The distribution of thinned ranks should be uniform if the model is well-calibrated. (Thinning is necessary because HMC produces dependent posterior samples; by retaining only every \(t\)-th draw, we reduce autocorrelation and better approximate the independence assumption underlying the rank statistic.) Non-uniformity indicates problems with either the inference algorithm or parameter identification.

Following the Stan User’s Guide (§10.1), we set the number of post-thinned posterior draws \(L\) and the number of SBC iterations \(N\) to be equal, both 999, so that the rank statistic takes one of \(L + 1 = 1000\) possible values. We use a single MCMC chain to avoid any between-chain alignment issues and thin by a factor of 4 to reduce autocorrelation.

0.2 SBC Methodology

0.2.1 Rank Statistics

For each parameter θ, we compute the rank of the true value θ* within the posterior samples {θ₁, θ₂, …, θₛ}:

\[ \text{rank}(\theta^*) = \sum_{s=1}^{S} \mathbf{1}[\theta_s < \theta^*] \]

If the posterior is calibrated, this rank follows a discrete uniform distribution on {0, 1, …, S}.

0.2.2 Diagnostics

We use several complementary diagnostics to assess calibration. Each provides a different lens on the same underlying question—whether the posterior is well-calibrated—so that no single summary statistic drives our conclusions.

  1. Rank histograms: A visual check for uniformity. Each bin of the histogram counts how many SBC simulations produced a thinned rank in a given range. If the posterior is calibrated, all bins should contain roughly the same number of counts (i.e., the histogram should be approximately flat). Systematic patterns—such as a U-shape (overdispersed posteriors) or an inverted-U (underdispersed posteriors)—indicate specific calibration failures.

  2. ECDF plots: The empirical cumulative distribution function of the normalized ranks is plotted against the CDF of a Uniform(0, 1) distribution (the diagonal line). Departures from the diagonal reveal the same calibration issues as the rank histogram, but can be easier to read when the number of SBC simulations is small.

  3. Chi-square goodness-of-fit tests: A formal test of whether the observed bin counts in the rank histogram are consistent with a uniform distribution. The test statistic \(\chi^2 = \sum_b (O_b - E_b)^2 / E_b\) is compared to a \(\chi^2\) distribution with \(B - 1\) degrees of freedom (where \(B\) is the number of bins). A small \(p\)-value (e.g., \(p < 0.05\)) suggests the ranks are not uniform. With 999 SBC simulations and 20 bins, the expected count per bin is approximately 50, which is well within the range where the chi-square approximation is reliable.

  4. Kolmogorov-Smirnov (KS) tests: A non-parametric test comparing the rank ECDF to the Uniform(0, 1) CDF. The KS statistic equals the maximum vertical distance between the two curves. Like the chi-square test, a small \(p\)-value suggests non-uniformity. The KS test does not require binning, but can behave conservatively with discrete rank data.

0.2.3 Study Design

We use matched study designs for m_0 and m_1 to enable fair comparison:

Show code
# Study design configurations
config_m0 = {
    "M": 25,                    # Number of uncertain decision problems
    "K": 3,                     # Number of consequences
    "D": 5,                     # Feature dimensions
    "R": 15,                    # Distinct alternatives
    "min_alts_per_problem": 2,
    "max_alts_per_problem": 5,
    "feature_dist": "normal",
    "feature_params": {"loc": 0, "scale": 1}
}

config_m1 = {
    **config_m0,                # Same uncertain problem structure
    "N": 25,                    # Risky problems (matching M)
    "S": 15,                    # Risky alternatives (matching R)
}

print("Study Design Comparison:")
print(f"\nm_0 (Uncertain Only):")
print(f"  M = {config_m0['M']} decision problems")
print(f"  Total choices: ~{config_m0['M'] * 3.5:.0f}")

print(f"\nm_1 (Uncertain + Risky):")
print(f"  M = {config_m1['M']} uncertain + N = {config_m1['N']} risky")
print(f"  Total choices: ~{(config_m1['M'] + config_m1['N']) * 3.5:.0f}")
Study Design Comparison:

m_0 (Uncertain Only):
  M = 25 decision problems
  Total choices: ~88

m_1 (Uncertain + Risky):
  M = 25 uncertain + N = 25 risky
  Total choices: ~175
Show code
# SBC configuration following Stan User's Guide §10.1 recommendations.
# With L post-thinned draws per simulation, the rank statistic takes values
# in {0, 1, ..., L}, giving L + 1 possible values.  Setting L = N_sbc = 999
# means the expected rank histogram is discrete-uniform on 1000 values;
# with 20 bins we get ~50 counts per bin—well within the chi-square
# approximation's comfort zone—and the KS confidence band shrinks to
# ε ≈ 0.043, roughly 3× narrower than the previous n = 100 design.

n_sbc_sims = 999            # Number of SBC iterations (N)
n_mcmc_chains = 1           # Single chain (standard for SBC)
thin = 4                    # Thinning factor
n_mcmc_samples = thin * n_sbc_sims   # 3996 total draws → 999 post-thinned
max_rank = n_mcmc_samples // thin     # 999 (= L)
n_bins = 20                 # Histogram bins
expected_per_bin = n_sbc_sims / n_bins  # ~49.95

print("SBC Configuration:")
print(f"  Simulations (N):        {n_sbc_sims}")
print(f"  MCMC samples per sim:   {n_mcmc_samples}")
print(f"  Chains:                 {n_mcmc_chains}")
print(f"  Thinning factor:        {thin}")
print(f"  Post-thinned draws (L): {max_rank}")
print(f"  Possible rank values:   {max_rank + 1}")
print(f"  Histogram bins:         {n_bins}")
print(f"  Expected counts/bin:    {expected_per_bin:.1f}")
SBC Configuration:
  Simulations (N):        999
  MCMC samples per sim:   3996
  Chains:                 1
  Thinning factor:        4
  Post-thinned draws (L): 999
  Possible rank values:   1000
  Histogram bins:         20
  Expected counts/bin:    50.0

0.3 Running SBC for m_0

First, we run SBC for model m_0 (uncertain choices only) to establish a baseline:

Show code
from utils.study_design import StudyDesign
from analysis.sbc import SimulationBasedCalibration

# Create m_0 study design
study_m0 = StudyDesign(
    M=config_m0["M"],
    K=config_m0["K"],
    D=config_m0["D"],
    R=config_m0["R"],
    min_alts_per_problem=config_m0["min_alts_per_problem"],
    max_alts_per_problem=config_m0["max_alts_per_problem"],
    feature_dist=config_m0["feature_dist"],
    feature_params=config_m0["feature_params"],
    design_name="sbc_m0"
)
study_m0.generate()
Show code
# Create output directory
output_dir_m0 = tempfile.mkdtemp(prefix="sbc_m0_")

# Run SBC for m_0
sbc_m0 = SimulationBasedCalibration(
    sbc_model_path=os.path.join(project_root, "models", "m_0_sbc.stan"),
    study_design=study_m0,
    output_dir=output_dir_m0,
    n_sbc_sims=n_sbc_sims,
    n_mcmc_samples=n_mcmc_samples,
    n_mcmc_chains=n_mcmc_chains,
    thin=thin
)

ranks_m0, true_params_m0 = sbc_m0.run()
m_0 SBC Complete:
  Simulations: 999
  Parameters: 18

0.4 Running SBC for m_1

Now we run SBC for model m_1 (uncertain + risky choices):

Show code
from utils.study_design_m1 import StudyDesignM1

# Create m_1 study design
study_m1 = StudyDesignM1(
    M=config_m1["M"],
    N=config_m1["N"],
    K=config_m1["K"],
    D=config_m1["D"],
    R=config_m1["R"],
    S=config_m1["S"],
    min_alts_per_problem=config_m1["min_alts_per_problem"],
    max_alts_per_problem=config_m1["max_alts_per_problem"],
    risky_probs="random",
    feature_dist=config_m1["feature_dist"],
    feature_params=config_m1["feature_params"],
    design_name="sbc_m1"
)
study_m1.generate()
Show code
# Create output directory  
output_dir_m1 = tempfile.mkdtemp(prefix="sbc_m1_")

# Run SBC for m_1
sbc_m1 = SimulationBasedCalibration(
    sbc_model_path=os.path.join(project_root, "models", "m_1_sbc.stan"),
    study_design=study_m1,
    output_dir=output_dir_m1,
    n_sbc_sims=n_sbc_sims,
    n_mcmc_samples=n_mcmc_samples,
    n_mcmc_chains=n_mcmc_chains,
    thin=thin
)

ranks_m1, true_params_m1 = sbc_m1.run()
m_1 SBC Complete:
  Simulations: 999
  Parameters: 18

0.5 Comparing Rank Distributions

The key diagnostic is comparing rank distributions between m_0 and m_1. For well-calibrated parameters, ranks should be uniformly distributed.

0.5.1 α (Sensitivity Parameter)

Show code
K, D = config_m0['K'], config_m0['D']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# m_0 alpha ranks (index 0)
ax = axes[0]
counts_m0, bins_m0, _ = ax.hist(ranks_m0[:, 0], bins=n_bins, alpha=0.7, 
                                 color='steelblue', edgecolor='white')
# Expected count per bin for uniform distribution
expected_count = len(ranks_m0) / n_bins
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
           label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('m_0: α Rank Distribution', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

# m_1 alpha ranks (index 0)
ax = axes[1]
counts_m1, bins_m1, _ = ax.hist(ranks_m1[:, 0], bins=n_bins, alpha=0.7, 
                                 color='forestgreen', edgecolor='white')
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
           label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('m_1: α Rank Distribution', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Chi-square tests
chi2_m0_alpha, p_m0_alpha = stats.chisquare(np.histogram(ranks_m0[:, 0], bins=n_bins)[0])
chi2_m1_alpha, p_m1_alpha = stats.chisquare(np.histogram(ranks_m1[:, 0], bins=n_bins)[0])
print(f"\nα Uniformity Tests:")
print(f"  m_0: χ² = {chi2_m0_alpha:.2f}, p = {p_m0_alpha:.3f}")
print(f"  m_1: χ² = {chi2_m1_alpha:.2f}, p = {p_m1_alpha:.3f}")
Figure 1: SBC rank distributions for α. With 999 SBC simulations the chi-square test is well-powered. See the printed test statistics below for the exact values in this run.

α Uniformity Tests:
  m_0: χ² = 17.76, p = 0.539
  m_1: χ² = 22.96, p = 0.239
NoteInterpreting the α results

Both models are expected to identify α reasonably well, since sensitivity influences all choice probabilities. With 999 SBC simulations the chi-square test has adequate power to detect moderate departures from uniformity, so the p-values should be taken at face value: a rejection at the 0.05 level is meaningful evidence of miscalibration. If m_0 shows a notably lower p-value for α than m_1, that could indicate a mild calibration issue stemming from the β–δ identification problem propagating to α estimates; the parameter recovery results in Report 4 are relevant context here.

This is the critical comparison. In m_0, the β–δ interaction makes δ difficult to identify from uncertain choices alone (Report 4); in m_1, risky choices supply direct information about utilities that should improve δ calibration:

Show code
# Delta parameters are after alpha (1) and beta (K*D)
delta_start_idx = 1 + K * D
K_minus_1 = K - 1

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Expected count per bin for uniform distribution
expected_count = len(ranks_m0) / n_bins

for k in range(K_minus_1):
    param_idx = delta_start_idx + k
    
    # m_0
    ax = axes[0, k]
    counts_m0_delta, _, _ = ax.hist(ranks_m0[:, param_idx], bins=n_bins, alpha=0.7, 
                                     color='steelblue', edgecolor='white')
    ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
               label=f'Uniform (E={expected_count:.1f})')
    ax.set_xlabel('Rank', fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'm_0: δ_{k+1} Rank Distribution', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # m_1
    ax = axes[1, k]
    counts_m1_delta, _, _ = ax.hist(ranks_m1[:, param_idx], bins=n_bins, alpha=0.7,
                                     color='forestgreen', edgecolor='white')
    ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
               label=f'Uniform (E={expected_count:.1f})')
    ax.set_xlabel('Rank', fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'm_1: δ_{k+1} Rank Distribution', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Figure 2: SBC rank distributions for δ parameters. Compare the shape of the rank histograms across models: departures from uniformity indicate poor calibration. See the formal test results below for quantitative assessment.
δ Parameter Uniformity Tests:
--------------------------------------------------

δ_1:
  m_0: χ² = 14.79, p = 0.736 | KS = 0.022, p = 0.709
  m_1: χ² = 24.08, p = 0.193 | KS = 0.024, p = 0.603

δ_2:
  m_0: χ² = 14.79, p = 0.736 | KS = 0.022, p = 0.709
  m_1: χ² = 24.08, p = 0.193 | KS = 0.024, p = 0.603
NoteInterpreting the δ results

Calibration and identification are related but distinct concepts. Calibration refers to whether the posterior accurately represents uncertainty (uniform SBC ranks). Identification refers to whether the data are informative enough to pin down the parameter. Poor identification leads to poor calibration in SBC, because when the data are uninformative the posterior fails to update appropriately from the prior, which manifests as non-uniform ranks (often a central peak, indicating the posterior is too diffuse).

When reading the test statistics, note the following caveats:

  • Chi-square p-values depend on the particular binning. With 999 simulations and 20 bins, expected counts per bin are approximately 50, so the chi-square approximation is reliable and the test has good power to detect moderate departures from uniformity.
  • KS test values should in principle differ between models. If they appear identical or very similar, check whether the rank distributions actually differ visually in the histograms and ECDF plots. The KS test can behave conservatively with discrete rank data (since it was designed for continuous distributions), which may reduce its sensitivity to real differences between models.
  • The rank histograms and ECDF plots are the primary diagnostics; the formal tests supplement but do not replace visual inspection.

0.5.2 ECDF Comparison

The Empirical Cumulative Distribution Function (ECDF) provides another view of calibration. For well-calibrated parameters, the ECDF should follow the diagonal:

Show code
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for k in range(K_minus_1):
    param_idx = delta_start_idx + k
    
    # Normalize ranks to [0, 1] using max_rank (effective samples after thinning)
    ranks_norm_m0 = np.sort(ranks_m0[:, param_idx]) / max_rank
    ranks_norm_m1 = np.sort(ranks_m1[:, param_idx]) / max_rank
    
    ecdf = np.arange(1, len(ranks_norm_m0) + 1) / len(ranks_norm_m0)
    
    ax = axes[k]
    ax.step(ranks_norm_m0, ecdf, where='post', label='m_0', color='steelblue', linewidth=2)
    ax.step(ranks_norm_m1, ecdf, where='post', label='m_1', color='forestgreen', linewidth=2)
    ax.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Uniform')
    
    # Add 95% confidence band
    n = len(ecdf)
    epsilon = np.sqrt(np.log(2/0.05) / (2 * n))  # Kolmogorov-Smirnov band
    ax.fill_between([0, 1], [0-epsilon, 1-epsilon], [0+epsilon, 1+epsilon], 
                    alpha=0.2, color='red', label='95% CI')
    
    ax.set_xlabel('Normalized Rank', fontsize=11)
    ax.set_ylabel('ECDF', fontsize=11)
    ax.set_title(f'δ_{k+1} ECDF Comparison', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()
Figure 3: ECDF plots for δ parameters. The closer the curve to the diagonal, the better the calibration. The shaded band shows the 95% Kolmogorov-Smirnov confidence region for n=999 simulations. Compare the m_0 and m_1 curves to assess whether adding risky choices improves δ calibration.

The 95% confidence band width is \(\epsilon \approx 0.043\) for \(n=999\) simulations, roughly three times narrower than the \(\epsilon \approx 0.136\) that would obtain with only 100 iterations. This gives us sensitivity to detect relatively small departures from uniformity.

0.5.3 β (Feature Weight Parameters)

Finally, we examine β calibration. Because β and δ interact multiplicatively in the expected utility (see the identification analysis in Report 5), poor identification of δ in m_0 can propagate to β, potentially degrading its calibration as well. In m_1, if risky choices successfully pin down δ, the β parameters may also benefit from improved calibration.

Show code
# Collect beta chi-square p-values for all K*D parameters
beta_start_idx = 1
beta_end_idx = 1 + K * D

beta_pvals_m0 = []
beta_pvals_m1 = []

for idx in range(beta_start_idx, beta_end_idx):
    counts_m0 = np.histogram(ranks_m0[:, idx], bins=n_bins)[0]
    counts_m1 = np.histogram(ranks_m1[:, idx], bins=n_bins)[0]
    
    _, p_m0 = stats.chisquare(counts_m0)
    _, p_m1 = stats.chisquare(counts_m1)
    
    beta_pvals_m0.append(p_m0)
    beta_pvals_m1.append(p_m1)

fig, ax = plt.subplots(figsize=(10, 5))

x = np.arange(K * D)
width = 0.35

bars1 = ax.bar(x - width/2, beta_pvals_m0, width, label='m_0', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, beta_pvals_m1, width, label='m_1', color='forestgreen', alpha=0.7)

ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='α = 0.05')
ax.set_xlabel('β Parameter Index', fontsize=11)
ax.set_ylabel('Chi-square p-value', fontsize=11)
ax.set_title('β Parameter Calibration (p-values)', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\nβ Calibration Summary:")
print(f"  m_0: {np.sum(np.array(beta_pvals_m0) > 0.05)}/{K*D} parameters well-calibrated (p > 0.05)")
print(f"  m_1: {np.sum(np.array(beta_pvals_m1) > 0.05)}/{K*D} parameters well-calibrated (p > 0.05)")
Figure 4: Summary of β parameter SBC calibration. Because β and δ interact in the expected utility, the β–δ identification problem in m_0 may affect β calibration as well. Compare p-values across models to assess whether m_1’s improved δ identification also benefits β.

β Calibration Summary:
  m_0: 13/15 parameters well-calibrated (p > 0.05)
  m_1: 15/15 parameters well-calibrated (p > 0.05)

0.6 Summary Statistics

Show code
# Build summary table
summary_rows = []

# Alpha
counts_m0 = np.histogram(ranks_m0[:, 0], bins=n_bins)[0]
counts_m1 = np.histogram(ranks_m1[:, 0], bins=n_bins)[0]
chi2_m0, p_m0 = stats.chisquare(counts_m0)
chi2_m1, p_m1 = stats.chisquare(counts_m1)

summary_rows.append({
    'Parameter': 'α',
    'm_0 χ²': f'{chi2_m0:.2f}',
    'm_0 p-value': f'{p_m0:.3f}',
    'm_1 χ²': f'{chi2_m1:.2f}',
    'm_1 p-value': f'{p_m1:.3f}',
    'Improvement': '—' if p_m0 > 0.05 else ('✓' if p_m1 > 0.05 else '—')
})

# Beta (aggregated)
mean_p_m0 = np.mean(beta_pvals_m0)
mean_p_m1 = np.mean(beta_pvals_m1)
summary_rows.append({
    'Parameter': 'β (mean)',
    'm_0 χ²': '—',
    'm_0 p-value': f'{mean_p_m0:.3f}',
    'm_1 χ²': '—',
    'm_1 p-value': f'{mean_p_m1:.3f}',
    'Improvement': '—'
})

# Delta
for k in range(K_minus_1):
    result = delta_chi2_results[k]
    improvement = '✓' if (result['m0_p'] < 0.05 and result['m1_p'] > 0.05) else \
                  ('↑' if result['m1_p'] > result['m0_p'] else '—')
    summary_rows.append({
        'Parameter': f'δ_{k+1}',
        'm_0 χ²': f'{result["m0_chi2"]:.2f}',
        'm_0 p-value': f'{result["m0_p"]:.3f}',
        'm_1 χ²': f'{result["m1_chi2"]:.2f}',
        'm_1 p-value': f'{result["m1_p"]:.3f}',
        'Improvement': improvement
    })

summary_df = pd.DataFrame(summary_rows)
print(summary_df.to_string(index=False))
Table 1: SBC calibration comparison between m_0 and m_1. Higher p-values suggest better calibration (uniformity of rank distribution). With 999 simulations and ~50 expected counts per bin, the chi-square tests are well-powered.
Parameter m_0 χ² m_0 p-value m_1 χ² m_1 p-value Improvement
        α  17.76       0.539  22.96       0.239           —
 β (mean)      —       0.518      —       0.526           —
      δ_1  14.79       0.736  24.08       0.193           —
      δ_2  14.79       0.736  24.08       0.193           —

0.7 Discussion

0.7.1 Interpretation of Results

The SBC analysis compares calibration between m_0 and m_1. The key question is whether adding risky choices improves calibration of the δ parameters, which were poorly identified in m_0’s parameter recovery analysis (Report 4).

TipKey Finding: SBC Provides Evidence of Improved δ Calibration

Model m_0 (uncertain choices only):

  • α: Generally well-calibrated, though some runs may show borderline p-values (see the α discussion above)
  • β: Calibration may be affected by the β–δ identification interaction; check results against the parameter recovery findings in Report 4
  • δ: Expected to show poor calibration (non-uniform ranks reflecting identification problems)

Model m_1 (uncertain + risky choices):

  • α: Generally well-calibrated
  • β: Expected to benefit from improved δ identification
  • δ: Expected to show improved calibration (more uniform ranks)

The SBC results provide further evidence that adding risky choices improves δ calibration, consistent with the theoretical identification analysis in Report 5. With 999 SBC simulations, 999 post-thinned posterior draws per simulation, and ~50 expected counts per histogram bin, the formal test statistics are well-powered and the chi-square approximation is reliable.

NoteDistinguishing Identification from Inference Problems

SBC can detect both inference failures (bugs in the sampler) and identification problems (structural non-identifiability). We can distinguish them by checking MCMC diagnostics: if \(\hat{R} \approx 1\) and ESS is adequate but ranks are non-uniform, the issue is identification rather than inference. In our analysis, both models show good MCMC diagnostics, suggesting that non-uniform δ ranks in m_0 (if observed) reflect identification limitations rather than computational issues.

0.7.2 Why Non-Uniform Ranks Indicate Calibration Failure

When a parameter is poorly identified, the posterior fails to concentrate around the true value. This manifests in the SBC rank distribution as follows:

  1. The posterior is too wide relative to the prior (weak updating from data)
  2. True values tend to have central ranks (ranks cluster around the median)
  3. The rank histogram shows a characteristic peak in the middle

This pattern—if observed for δ in m_0—would be consistent with the model’s inability to distinguish between different utility functions on the basis of uncertain choices alone.

0.7.3 Why m_1 Is Expected to Improve Calibration

In m_1, risky choices provide direct information about utilities without confounding with subjective probabilities:

\[ \text{Risky EU: } \eta^{(r)}_s = \sum_{k=1}^K \pi_{sk} \cdot \upsilon_k \]

where π are the known objective probabilities. This is expected to improve calibration because:

  1. Risky choices constrain υ (and hence δ) directly, breaking the β–δ confound
  2. With δ better identified, uncertain choices more effectively constrain β
  3. Both choice types constrain α

0.8 Conclusion

Simulation-Based Calibration provides a principled framework for assessing posterior calibration, complementing the parameter recovery analysis of Report 4:

  1. m_0 faces a calibration challenge for δ: The β–δ identification problem documented in Reports 4 and 5 is expected to manifest as non-uniform SBC ranks for δ, and potentially for β as well

  2. m_1 is expected to improve δ calibration: Adding risky choices provides a direct route to utility identification, which should yield more uniform SBC ranks for δ

  3. The improvement is structural: The change is not simply about adding more data, but about the type of data—risky choices provide qualitatively different information than uncertain choices, as formalized in the Anscombe-Aumann framework (Report 5)

The degree of improvement observed in any particular run depends on the random seed and the study design. With 999 SBC simulations the formal tests are well-powered, so readers can rely on both the rank histograms / ECDF plots and the chi-square and KS statistics when drawing conclusions.

NoteOn Statistical Power

With 999 SBC simulations and 20 histogram bins, the expected count per bin is approximately 50, well within the range where the chi-square approximation is reliable. The KS test can still behave conservatively with discrete rank data (since it was designed for continuous distributions), so visual inspection of rank histograms and ECDF plots remains a useful complement to the formal tests.

0.9 References

Reuse

Citation

BibTeX citation:
@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Simulation-Based {Calibration}},
  date = {2026-02-27},
  url = {https://jeffhelzner.github.io/seu-sensitivity/foundations/06_sbc_validation.html},
  langid = {en}
}
For attribution, please cite this work as:
Helzner, Jeff. 2026. “Simulation-Based Calibration.” SEU Sensitivity Project. February 27, 2026. https://jeffhelzner.github.io/seu-sensitivity/foundations/06_sbc_validation.html.